Application of referenced thermodynamic integration to Bayesian model selection

Evaluating normalising constants is important across a range of topics in statistical learning, notably Bayesian model selection. However, in many realistic problems this involves the integration of analytically intractable, high-dimensional distributions, and therefore requires the use of stochastic methods such as thermodynamic integration (TI). In this paper we apply a simple but under-appreciated variation of the TI method, here referred to as referenced TI, which computes a single model’s normalising constant in an efficient way by using a judiciously chosen reference density. The advantages of the approach and theoretical considerations are set out, along with pedagogical 1 and 2D examples. The approach is shown to be useful in practice when applied to a real problem —to perform model selection for a semi-mechanistic hierarchical Bayesian model of COVID-19 transmission in South Korea involving the integration of a 200D density.


Introduction
The marginalised likelihood, or normalising constant of a model, is a feature central to the principles and pathology of Bayesian statistics. For example -given two models representing two competing hypotheses, the ratio of the normalising constants (known as the Bayes factor), describes the relative probability of the data having been generated by one hypothesis compared to the other. Consequently, at a practical level the estimation of normalising constants is an important topic for model selection in the Bayesian setting [1].
In practice, estimating a normalising constant relies on "integrating out" or marginalising the parameters of the model to get the probability the associated hypothesis produced the data. But in general this is difficult, because we cannot easily integrate arbitrary highdimensional distributions -certainly analytical or quadrature-based methods are of little help directly. As a result, practitioners turn to a range of approaches, typically based on statistical sampling. Specific examples include bridge sampling [2,3], stochastic density of states based methods [4,5] and thermodynamic integration [6][7][8][9]. This work focuses on the latter -thermodynamic integration -in particular on the development of the practical details for efficient application for Bayesian model selection. By way of introduction for researchers unfamiliar with thermodynamic integration (TI), it allows us to estimate the ratio of two normalising constants in a general and asymptotically exact way. Instead of marginalising the associated densities explicitly in terms of the highdimensional integrals, using TI we only have to evaluate a 1-dimensional integral, where the integrand can easily be sampled with Markov Chain Monte Carlo (MCMC). To see how this works, consider two models labelled 1 and 2 with normalising constants z 1 and z 2 . Each z is given by where q i is a density for model M i with parameters θ, that gives the model's Bayesian posterior density as To apply thermodynamic integration we introduce the concept of a path between q 1 (θ) and q 2 (θ), linking the two densities via a series of intermediate ones. This family of densities, parameterised by the path coordinate λ, is denoted by q(λ;θ). An example path in λ is shown in Fig 1A. The density q(λ;θ), linking q 1 to q 2 and defining the intermediate densities, can be chosen to have an optimal or in some way convenient path. A common choice based on convenience is the geometric one qðl; θÞ ¼ q l 2 ðθÞq 1À l 1 ðθÞ ; l 2 ½0; 1� : The important point to note is that for λ = 0, q(λ;θ) returns the first density q(0;θ) = q 1 (θ), for λ = 1 it gives q(1;θ) = q 2 (θ), and for in-between λ values a log-linear mixture of the endpoint densities. Just as we have defined a family of densities, there is an associated normalising constant for any point along the path, that for any value of λ is given by A further small but important point to avoid complications is to have densities with common support, for example O(λ = 1) = O(λ = 0). Hereafter support is denoted by O.
Having set up the definitions of q(λ;θ) and z(λ), the TI expression can be derived, to compute the log-ratio of z 1 = z(λ = 0) and z 2 = z(λ = 1), while avoiding explicit integrals over the log differentiation rules again (fourth step), identifying the expectation E qðl;θÞ from sampling distribution q(λ;θ) (fifth step), differentiation of the geometric path for q(λ) (sixth step), and finally equivalence of sampling from q and p. The final line in the expression summarises the usefulness of TI. Instead of having to work with the complicated high-dimensional integrals of Eq 1 to find the log-Bayes factor log z 2 z 1 , which measures the relative probability of getting the data from one hypothesis compared to another, we only need to consider a 1-dimensional integral of an expectation, and that expectation can be readily produced by MCMC.
In our paper we examine the details of a referenced TI approach, which is a variation on the TI theme that we find useful to enable fast and accurate normalising constant calculations. Our main contributions are as follows: • We show how to generate a reference normalising constant from an exactly-integratable reference density, through sampling or gradients, and with parameter constraints. And we present how to use this reference in the TI method to efficiently estimate a normalising constant of an arbitrary high-dimensional density.
• We discuss performance benchmarks for a well-known problem in the statistical literature [10], which shows the method performs favourably in terms of accuracy and the number of iterations to convergence.
• Finally the technique is applied to a hierarchical Bayesian time-series model describing the COVID-19 epidemic in South Korea.
In relation to other work, we recognise using a reference for thermodynamic integration is a topic that has been raised, especially in early theoretically-oriented literature [11][12][13]. Our additional contribution is to bridge the gap from theory and simple examples to application, which includes choosing the reference using MCMC samples or gradients, examination of reference support, comparisons of convergence, and illustration of the approach for a non-trivial real-world problem.

Referenced TI
Introducing a reference density and associated normalising constant as yields an efficient approach to compute Bayes factors, or more generally to marginalise an arbitrary density for any application. To clarify notation, here z is the normalising constant of interest with density q, z ref is a reference normalising constant with associated density q ref .
In the second line the ratio z/z ref is straightforwardly given by the thermodynamic integral identity in Eq 2.
While the Eq 2 can be directly applied to conduct a pairwise model comparison between two hypotheses, by introducing a reference we can naturally marginalise the density of a single model [11,12]. This is useful when comparing multiple models as n > n 2 À � for n > 3. Another motivation to reference the TI is the MCMC computational efficiency of converging the TI expectation. In Eq 3, with judicious choice of q ref , the reference normalising constant z ref can be evaluated analytically and account for most of z. In this case log qðθÞ q ref ðθÞ tends to have a small expectation and variance and converges quickly. This idea of using an exactly solvable reference, to aid in the solution of an otherwise intractable problem, has been a recurrent theme in the computational and mathematical sciences in general [14][15][16][17], and variations on this approach have been used to compute normalising constants in various guises in the statistical literature [8,9,11,[18][19][20][21][22][23][24]. For example, in the generalised stepping stone method a reference is introduced to speed up convergence of the importance sampling at each temperature rung [23,24]. In the work of [22] a theoretical discussion has been presented that shows the error budget of thermodynamic integration depends on the J-divergence of the densities being marginalised. Noting this, [19] provide an illustration for a 2-dimensional example in their work on recursive pathways to marginal likelihood estimation. And in the power posteriors method, a reference is used but the reference is a prior density and thus z ref = 1 [20]. This approach is elegant as the reference need not be chosen -it is simply the prior -however the downside is that for poorly chosen or uninformative priors, the thermodynamic integral will be slow to converge and susceptible to instability. In particular for complex hierarchical models with weakly informative priors this is found to be an issue.
For referenced TI as presented here, the reference density in Eq 3 can be chosen at convenience, but the main desirable features are that it should be easily formed without special consideration or adjustments and that z ref should be analytically integratable and account for as much of z as possible. Such a choice of z ref ensures the part with expensive sampling is small and converges quickly. An obvious choice in this regard is the Laplace-type reference, where the log-density is approximated with a second-order one, for example a multivariate Gaussian. For densities with a single concentration, Laplace-type approximations are ubiquitous, and an excellent natural choice for many problems. In the following section we consider approaches that can be used to formulate a reference normalising constant z ref from a second-order logdensity (though more generally other tractable references are possible). In each referenced TI scenario, we note that even if the reference approximation is poor, the estimate of the normalising constant based on Eq 3 remains asymptotically exact-only the speed of convergence is affected (subject to the assumptions of matching support for end-point densities).

Taylor expansion at the mode Laplace reference
The most straightforward way to generate a reference density is to Taylor expand the log-density to second order about a mode. Noting no linear term is present, we see the reference density is where H is the Hessian matrix and θ 0 is the vector of mode parameters. The associated normalising constant is ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi This approach to yield a reference density, using either analytic or finite difference gradients at mode, tends to produce a density close to the true one in the neighbourhood of θ 0 . But this is far from guaranteed, particularly if the density is asymmetric, or has non-negligible high-order moments, or is discontinuous for example exhibiting cusps. In many instances a more reliable choice of reference can be found by using MCMC samples from the whole posterior density.

Sampled covariance Laplace reference
A second straightforward approach to form a reference density, that's often more robust, is by drawing samples from the true density q(θ) to estimate the mean parametersθ and covariance matrixΣ, such that Then the reference normalising constant is ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi detð2pΣÞ This method of generating a reference is simple and reliable. It requires sampling from the posterior q(θ) so is more expensive than gradient-based methods, but the cost associated with drawing enough samples to generate a sufficiently good reference tends to be quite low. In the primary application discussed later, regarding structured high-dimensional Bayesian hierarchical models, we use this approach to generate a reference density and normalising constant.
Though the sampled covariance reference is typically a good approach, it is not in general optimal within the Laplace-type family of approaches -typically another Gaussian reference exists with different parameters that can generate a normalising constant closer to the true one, thus potentially leading to overall faster convergence of the thermodynamic integral to the exact value. Such an optimal reference can be identified variationally, as we show in S1 Appendix

Reference support
If a model involves a bounded parameter space, for example θ 1 2 [0, 1), θ 2 2 (−1, 1) etc. as commonly arise in structured Bayesian models, in referenced TI the exact analytic integration for the reference density should be commensurately limited. This is necessary not only so the reference is closer to the true density to speed up convergence, but also so MCMC samples from both densities can be drawn on the same parameter space, as is required for the thermodynamic integrand in Eq 3 to be well-defined. However, the calculation of arbitrary probability density function orthants (an orthant is a specific region in a multi-dimensional space or more precisely a bounded n-dimensional space), even for well-known analytic functions such as the multivariate Gaussian, is in general a difficult problem. High-dimensional orthant computations usually require advanced techniques, the use of approximations, or sampling methods [25][26][27][28][29][30]. Fortunately, we can simplify our reference density to create a reference with tractable analytic integration for limits by using a diagonal approximation to the sampled covariance or Hessian matrix. For example the orthant of a diagonal multivariate Gaussian can be given in terms of the error function [31], ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where K denotes the set of indices of the parameters with lower limits a i . S diag is a diagonal covariance matrix, that is one containing only the variance of each of the parameters, without the covariance terms and Σ diag i denotes the i th element of the diagonal. Restricting our density to a diagonal one is a poorer approximation than using the full covariance matrix. In practice however this has not been a substantial drawback to the convergence of the thermodynamic integral-and again we state that the quality of the reference affects only convergence rather than eventual accuracy of the normalising constant. This behaviour is observed in the practical examples later considered, though we do recognise the distinction between accuracy and convergence and matters of asymptotic consistency using an MCMC estimator with finite iterations are not clear cut.

Technical implementation
Referenced TI was implemented in Python and Stan programming languages. Using Stan enables fast MCMC simulations with Hamiltonian Monte Carlo and No-U-Turn algorithm [32,33], and portability between other statistical languages. We also provide an example of carrying out referenced-TI in NumPyro [34]. The code for all examples shown in this paper is available at https://github.com/mrc-ide/referenced-TI. In the examples shown in Section Applications, we used 4 chains with 20,000 iterations per chain for the pedagogical examples, and 4 chains with 2,000 iterations for the other applications. In all cases, half of the iterations were used for the burn-in (alternatively warm-up [33]). Mixing of the chains and the sampling convergence was checked in each case, by ensuring that theR value was �1.05 and investigating the trace plots. TheR is a standard MCMC diagnostic that assesses the convergence of multiple parallel chains running in an MCMC simulation. It is computed by comparing the variance between chains to the variance within each chain. If chains have not fully converged, their variances will be larger compared to when they have reached convergence [35].
In all examples, the integral given in Eq 2 was discretised to allow computer simulations.

Applications
In this section we present applications of the referenced TI approach. In 1D Pedagogical Example and 2D Pedagogical Example with Constrained Parameters we give 1-and 2-dimensional pedagogical introductions to the approach. In Benchmarks-Radiata Pine we select a linear regression model for a well-known problem in the statistical literature, and finally in in Model Selection for the COVID-19 Epidemic in South Korea we consider a challenging model selection task for a structured Bayesian model of the COVID-19 epidemic in South Korea.

1D pedagogical example
To illustrate the technique we consider the 1-dimensional density ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi jy À 4j with normalising constant z ¼ R 1 À 1 qðyÞdy. This density has a cusp and it does not have an analytical integral that easily generalises to multiple dimensions.
In this instance the Laplace approximation based on the second-order Taylor expansion at the mode will fail due to the cusp, so we use the more robust covariance sampling method. Sampling from the 1D density q(θ) we find a variance ofŝ consider optimal paths-satisfactory convergence is easily achieved using a simple geometric path with 4 λ-intervals.

2D pedagogical example with constrained parameters
As a second example, consider a 2-dimensional unnormalised density function with a constrained parameter space: and y 1 2 ½0; þ1Þ and y 2 2 ðÀ 1; þ1Þ : A reference density q ref (θ) can be constructed from the Hessian at the mode of q(θ). To marginalise it on a parameter space with restricted support, we use a reference density q diag ref ðyÞ based on a diagonal Hessian, that has an exact and easy to calculate orthant. All densities are shown in Fig 2. To obtain the log-evidence of the model, we calculated the exact value numerically [36,37], and using a sampled diagonal covariance matrix, as per Eq 8, to account for the lower bound of the parameter θ 1 . Without this restriction the final normalising constant is overestimatedif the support of the parameters in the MCMC is not the same as for the analytic z ref calculation, z ref as shown in Eq 3 does not cancel with the TI reference. Numerical comparison of the referenced TI to quadrature is presented in the S2 Appendix.

Benchmarks-Radiata Pine
To benchmark the application of the referenced TI in the model selection task, two non-nested linear regression models are compared for the radiata pine data set [10]. This example has been widely used for testing normalising constant calculating methods, since in this instance the exact value of the model evidence can be computed. The data consists of 42 3-dimensional data-points, expressed as y i -maximum compression strength, x i -density and z i -density adjusted for resin content. In this example, we follow the approach of [21], using the priors from therein, and test which of the two models M 1 and M 2 provides better predictions for the compression strength: ; t À 1 Þ; i ¼ 1; :::; n ; ; r À 1 Þ; i ¼ 1; :::; n : Five methods of estimating the model evidence were used in this example: Laplace approximation using a sampled covariance matrix, model switch TI along a path directly connecting the models [8,38], referenced TI, power posteriors with equidistant 11 λ-placements (labelled here as PP 11 ) and power posteriors with 100 λ-s (PP 100 ), following the example from [21]. For the model switch TI, referenced TI and PP 11 Fig 3A. We can see from the plot, that referenced TI presents favourable convergence to the exact value, whereas PP 100 oscillates around it. Fig 3B shows the distribution of log-evidence for the same model generated by 15 runs of the three algorithms: Laplace approximation with sampled covariance matrix, referenced TI and PP 100 . Similar figures for model M 1 are given in the S2 Appendix. Although all three methods resulted in a log-evidence satisfactorily close to the exact solution, referenced TI was the most accurate and importantly, converged fastest (308 MCMC draws compared to 55,000 draws needed for the power posterior method to achieve standard error of 0.5%, excluding burn-in, see S2 Appendix).

Model selection for the COVID-19 epidemic in South Korea
The final example of using referenced TI for calculating model evidence is fitting a renewal model to COVID-19 case data from South Korea. The data were obtained from opendata.ecdc. europa.eu/covid19/casedistribution/csv. The model is based on a statistical representation of a stochastic branching process whose expectation mechanistically follows a renewal-type equation. Its derivation and details are explained in [39] and a short explanation of the model is provided in the S2 Appendix. Briefly, the model is fitted to the time-series case data and estimates a number of parameters, including serial interval and the effective reproduction number, R t . The number of cases for each day are modelled by a negative binomial likelihood, with location parameter estimated by a renewal equation. Three modifications of the original model are tested here: • variation of the infection generation interval for values GI = 5, 6, 7, 8, 9, 10, 20-where GI denotes the mean of Rayleigh-distributed generation interval, • changing the order of the autoregressive model for the reproduction number, for AR(k) with k = 2, 3, 4 lags, • varying the length of the sliding window for estimating the reproduction number for values in W = 1, 2, 3, 4, 7 days.
Within each group of models, GI, AR and W, we want to select the best model through the highest evidence method. The dimension of each model was dependent on the modifications applied, but in all the cases the normalising constant was a 40-to 200-dimensional integral. The log-evidence of each model was calculated using the Laplace approximation with a sampled covariance matrix, and then correction to the estimate was obtained using referenced TI method. Values of the log-evidence for each model calculated by both Laplace and referenced TI methods are given in Table 1. Interestingly, the favoured model in each group, that is the model with the highest log-evidence, was different when the evidence was evaluated using the Laplace approximation than when it was evaluated with referenced TI. For example, using the Laplace method, sliding window of length 7 was incorrectly identified as the best model, whereas with referenced TI window of length 2 was chosen to be the best among the tested sliding windows models, which agrees with the previous studies of the window-length selection in H1N1 influenza and SARS outbreaks [40]. This exposes how essential it is to accurately determine the evidence, even good approximations can result in misleading results. Bayes factors for all model pairs are shown in the S2 Appendix. Table 1. Log-evidence estimated by Laplace and referenced TI approximations. In each section, model with the highest log-evidence estimated by Laplace or referenced TI method is indicated in bold. The credible intervals for logevidence comes from calculating the quantiles of the integral from Eq 2, where the integral values were obtained from the spline interpolated using running means of the expecations per λ over all iterations.

Model
Log

Interpretation of the COVID-19 model selection
The importance of performing model selection in a rigorous way is clear from Fig 4, where the generated R t time-series are plotted for the models favoured by Laplace and referenced TI methods (additional posterior densities are shown in the S2 Appendix. The differences in the R t time-series show the pitfalls of selecting an incorrect model. The differences between the two favoured models were most extreme for the GI = 8 and GI = 20 models. While a GI = 8 is plausible, even likely for COVID-19, GI = 20 is implausible given observed data [41]. This is further supported by observing that for GI = 20, favoured by the Laplace method, R t reached the value of over 125 in the first peak-around 100 more than for the GI = 8. The second peak was also largely overestimated, where R t reached a value of 75. We find it interesting to note that all models present a similar fit to the confirmed COVID-19 cases data (see S2 Appendix). This makes it impossible to select the best model through visual inspection and comparison of the model fits, or by using model selection methods that do not take the full posterior distributions into account. Although the models might fit the data well, other quantities generated, which are often of interest to the modeller, might be completely incorrect. Moreover, it emphasises the need to test multiple models before any conclusion or inference is undertaken, especially with the complex, hierarchical models.
Although often the Laplace approximation of the normalising constant is sufficient to pick the best model, it was not the case in this epidemiological model selection problem. We can see in Table 1 that the evidence was the highest for the "boundary" models when Laplace approximation was applied. For example, for the sliding window length models, when the Gaussian approximation was applied, the log-evidence was monotonically increasing with the value of W within the range of values that seem reasonable (W = 1 to 7). In contrast, with referenced TI, the log-evidence geometry is concave within the range of a priori reasonable parameters.

Discussion
The examples shown in Section Applications illustrate the applicability of the referenced TI approach for calculating model evidence. In the radiata pine example, referenced TI performed better than the other tested methods in terms of accuracy and speed. When using referenced TI, at λ = 0 values are sampled from the reference density rather than the prior as in the power posterior method, which should be closer to the original density (in the sense of Kullback-Leibler or Jensen-Shannon divergence). This leads not only to a more accurate estimate of the normalising constant, but also much faster convergence of the MCMC samples. A detailed theoretical characterisation of rates of convergence is beyond the scope of this article, nonetheless the empirical tests presented consistently show faster convergence than with comparative approaches. This is useful especially for evaluating model evidence in complex hierarchical models where each MCMC iteration is computationally demanding.
Although referenced thermodynamic integration and other methods using path-sampling have theoretical asymptotically exact Monte Carlo estimator limits, in practice a number of considerations affect accuracy. For example, biases will be introduced to the referenced TI estimate if one endpoint density substantially differs from another. An example of this and explanation is included in the S3 Appendix.
Furthermore, the discretisation of the coupling parameter path in λ can introduce a discretisation bias. For the power posteriors method, [9] propose an iterative way of selecting the λplacements to reduce the discretisation error. [42] test multiple λ-placements for 2-and 20D regression models, and report relative bias for each tested scenario. In the referenced TI algorithm discretisation bias is however negligible -the use of the reference density results in TI expectations that are both small and have low variance, and therefore curvature with respect to λ. In our framework we use geometric paths with equidistant coupling parameters λ between the un-normalised posterior densities, but there are other possible choices of the path constructions, for example a harmonic [7] or hypergeometric path [38]. This optimisation might be worth exploring, however, as illustrated in Fig 3B, the expectations evaluated vs λ are typically near-linear with referenced TI suggesting limited gains, although the extent of this will differ from problem to problem.
In the application to the renewal model for the COVID-19 epidemic in South Korea, we showed that for a complex structured model, hypothesis selection by Laplace approximation of the normalising constant can give misleading results. Using referenced TI, we calculated model evidence for 16 models, which enabled a quick comparison between chosen pairs of competing models. Importantly, the evidence given by the referenced TI was not monotonic with the increase of one of the parameters, which was the case for the Laplace approximation. The referenced TI presented here will similarly be useful in other situations particularly where the high-dimensional posterior distribution is uni-modal but non-Gaussian.

Conclusions
Normalising constants are fundamental in Bayesian statistics. In this paper we give an account of referenced thermodynamic integration (TI), in terms of theoretical consideration regarding the choice of reference, and show how it can be applied to realistic practical problems. We show how referenced TI allows efficient calculation of a single model's evidence by sampling from geometric paths between the un-normalised density of the model and a judiciously chosen reference density -here, a sampled multivariate normal that can be generated and integrated with ease. Referenced TI method has practical utility for substantially challenging problems of model selection in epidemiology and we suggest it has applicability in other fields of applied machine learning that rely on high-dimensional Bayesian models.